Search CORE

25 research outputs found

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Author: Bagon Shai
Dekel Tali
Geyer Michal
Tumanyan Narek
Publication venue
Publication date: 22/11/2022
Field of study

Large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, allowing us to synthesize diverse images that convey highly complex visual concepts. However, a pivotal challenge in leveraging such models for real-world content creation tasks is providing users with control over the generated content. In this paper, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation -- given a guidance image and a target text prompt, our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text, while preserving the semantic layout of the source image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model. This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the target image, requiring no training or fine-tuning and applicable for both real or generated guidance images. We demonstrate high-quality results on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing of the class and appearance of objects in a given image, and modifications of global qualities such as lighting and color

arXiv.org e-Print Archive

SceneScape: Text-Driven Consistent Scene Generation

Author: Abecasis Amit
Dekel Tali
Fridman Rafail
Kasten Yoni
Publication venue
Publication date: 30/05/2023
Field of study

We present a method for text-driven perpetual view generation -- synthesizing long-term videos of various scenes solely, given an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a unified mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles.Comment: Project page: https://scenescape.github.io

arXiv.org e-Print Archive

Revealing and modifying non-local variations in a single image

Author: Dekel Tali
Freeman William T.
Irani Michal
Michaeli Tomer
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/11/2015
Field of study

We present an algorithm for automatically detecting and visualizing small non-local variations between repeating structures in a single image. Our method allows to automatically correct these variations, thus producing an 'idealized' version of the image in which the resemblance between recurring structures is stronger. Alternatively, it can be used to magnify these variations, thus producing an exaggerated image which highlights the various variations that are difficult to spot in the input image. We formulate the estimation of deviations from perfect recurrence as a general optimization problem, and demonstrate it in the particular cases of geometric deformations and color variations.Israel Science Foundation (Grant 931/14)Shell Researc

DSpace@MIT

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Author: Bar-Tal Omer
Dekel Tali
Lipman Yaron
Yariv Lior
Publication venue
Publication date: 16/02/2023
Field of study

Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: https://multidiffusion.github.i

arXiv.org e-Print Archive

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Author: Bagon Shai
Bar-Tal Omer
Dekel Tali
Geyer Michal
Publication venue
Publication date: 23/07/2023
Field of study

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io

arXiv.org e-Print Archive

MoSculp: Interactive Visualization of Shape and Time

Author: Dekel Tali
Freeman William T.
He Qiurui
Mueller Stefanie
Owens Andrew
Wu Jiajun
Xue Tianfan
Zhang Xiuming
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

We present a system that allows users to visualize complex human motion via 3D motion sculptures---a representation that conveys the 3D structure swept by a human body as it moves through space. Given an input video, our system computes the motion sculptures and provides a user interface for rendering it in different styles, including the options to insert the sculpture back into the original video, render it in a synthetic scene or physically print it. To provide this end-to-end workflow, we introduce an algorithm that estimates that human's 3D geometry over time from a set of 2D images and develop a 3D-aware image-based rendering approach that embeds the sculpture back into the scene. By automating the process, our system takes motion sculpture creation out of the realm of professional artists, and makes it applicable to a wide range of existing video material. By providing viewers with 3D information, motion sculptures reveal space-time motion information that is difficult to perceive with the naked eye, and allow viewers to interpret how different parts of the object interact over time. We validate the effectiveness of this approach with user studies, finding that our motion sculpture visualizations are significantly more informative about motion than existing stroboscopic and space-time visualization methods.Comment: UIST 2018. Project page: http://mosculp.csail.mit.edu

arXiv.org e-Print Archive

Crossref

DSpace@MIT